Distributed Machine Learning with Communication Constraints
نویسنده
چکیده
Distributed Machine Learning with Communication Constraints by Yuchen Zhang Doctor of Philosophy in Computer Science University of California, Berkeley Professor Michael I. Jordan, Co-chair Professor Martin J. Wainwright, Co-chair Distributed machine learning bridges the traditional fields of distributed systems and machine learning, nurturing a rich family of research problems. Classical machine learning algorithms process the data by a single-thread procedure, but as the scale of the dataset and the complexity of the models grow rapidly, it becomes prohibitively slow to process on a single machine. The usage of distributed computing involves several fundamental trade-offs. On one hand, the computation time is reduced by allocating the data to multiple computing nodes. But since the algorithm is parallelized, there are compromises in terms of accuracy and communication cost. Such trade-offs puts our interests in the intersection of multiple areas, including statistical theory, communication complexity theory, information theory and optimization theory. In this thesis, we explore theoretical foundations of distributed machine learning under communication constraints. We study the trade-offs between communication and computation, as well as the trade-off between communication and learning accuracy. In particular settings, we are able to design algorithms that don’t compromise on either side. We also establish fundamental limits that apply to all distributed algorithms. In more detail, this thesis makes the following contributions: • We propose communication-efficient algorithms for statistical optimization. These algorithms achieve the best possible statistical accuracy and suffer the least possible computation overhead. • We extend the same algorithmic idea to non-parametric regression, proposing an algorithm which also guarantees the optimal statistical rate and superlinearly reduces the computation time. • In the general setting of regularized empirical risk minimization, we propose a distributed optimization algorithm whose communication cost is independent of the data size, and is only weakly dependent on the number of machines.
منابع مشابه
Network Constrained Distributed Dual Coordinate Ascent for Machine Learning
With explosion of data size and limited storage space at a single location, data are often distributed at different locations. We thus face the challenge of performing largescale machine learning from these distributed data through communication networks. In this paper, we study how the network communication constraints will impact the convergence speed of distributed machine learning optimizat...
متن کاملFundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation
Many machine learning approaches are characterized by information constraints onhow they interact with the training data. These include memory and sequential accessconstraints (e.g. fast first-order methods to solve stochastic optimization problems);communication constraints (e.g. distributed learning); partial access to the underlyingdata (e.g. missing features and multi-ar...
متن کاملTowards Geo-Distributed Machine Learning
Latency to end-users and regulatory requirements push large companies to build data centers all around the world. The resulting data is “born” geographically distributed. On the other hand, many machine learning applications require a global view of such data in order to achieve the best results. These types of applications form a new class of learning problems, which we call Geo-Distributed Ma...
متن کاملA Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees
This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms for decision tree induction from distributed data; and identifies the conditions under which the al...
متن کاملGlobal Warming: New Frontier of Research Deep Learning- Age of Distributed Green Smart Microgrid
The exponential increase in carbon-dioxide resulting Global Warming would make the planet earth to become inhabitable in many parts of the world with ensuing mass starvation. The rise of digital technology all over the world fundamentally have changed the lives of humans. The emerging technology of the Internet of Things, IoT, machine learning, data mining, biotechnology, biometric, and deep le...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016